Trying out community detection. At this point can only do this on the STRING based weighted edges generated in this notebook.
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()
In [2]:
import pandas as pd
In [3]:
cd ../../forGAVIN/mergecode/OUT/
In [4]:
!git annex unlock ../../../HBP/testdata/edgelist_update_weighted.txt
In [5]:
cp edgelist_update_weighted.txt ../../../HBP/testdata/
In [6]:
cd ../../../HBP/
Community detection code from Colin written in C++:
In [292]:
!make
This is how to run it:
In [7]:
!./run
Output is written to the OUT
file, so we have to unlock them first.
In [7]:
!git annex unlock OUT/*
In [8]:
!./run -file testdata/edgelist_update_weighted.txt -weighted 0 -algorithm 3 -quite 1
This file describes the communities produced:
In [9]:
!head -n 10 OUT/communityout.txt
In [10]:
!git annex unlock unweightedOUT/*
In [11]:
cp OUT/* unweightedOUT/
In [12]:
!./run -file testdata/edgelist_update_weighted.txt -weighted 1 -algorithm 3 -quite 1
We can provide a measure of the similarity of the communities detected using a Normalised Mutual Information test. As wikipedia puts it:
mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other.
Scikit-learn provides this test, all that's required is to load the data into Python.
In [13]:
import csv
In [14]:
def parsecommunityout(fname):
f = open(fname)
c = csv.reader(f,delimiter=" ")
communitydict = {}
for l in c:
if l[0] == "we":
communitydict[l[-1].strip("()")] = int(l[5])
sortedids = sorted(communitydict.keys(),key=int)
communities = map(communitydict.__getitem__,sortedids)
return sortedids,communities
In [15]:
wids,wcom = parsecommunityout("OUT/communityout.txt")
In [16]:
uwids,uwcom = parsecommunityout("unweightedOUT/communityout.txt")
In [17]:
import sklearn.metrics
In [18]:
sklearn.metrics.normalized_mutual_info_score(wcom,uwcom)
Out[18]:
But, this result might not be correct, as the class labels sorting the nodes into communities are arbitrary.
Again, the below code is not mine:
In [19]:
from __future__ import division
import collections, re, math, sys, csv
In [20]:
#--- helper function
def convertStr(s):
ret = int(s)
return ret
In [21]:
rows, new, dic = [], [], []
consensus = "OUT/communities_cytoscape.txt"
community = "unweightedOUT/communities_cytoscape.txt"
#--- readin first community file
with open(consensus, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter='=')
headerline = reader.next();
for row in reader:
new.append(row)
#--- readin second community file
with open(community, 'r') as csvfile:
reader = csv.reader(csvfile, delimiter='=')
headerline = reader.next();
for row in reader:
dic.append(row)
In [22]:
!git annex unlock OUT/community-metric_nmi.txt
In [23]:
for i in dic:
for j in new:
if i[0] == j[0]:
i.append(j[1])
dic.sort(key=lambda x: int(x[2]))
#--- count of nodes in each community in reality and in algorithm
realc, algc = [], []
for i in dic:
realc.append(convertStr(i[1]))
algc.append(convertStr(i[2]))
comcount = dict([(i,realc.count(i)) for i in set(realc)])
pcount = dict([(i,algc.count(i)) for i in set(algc)])
#--- holds the algorithm community label with the corresponding community labels
# provided e.g., {1:{2:2,3:3}}
algcount, tempd = {}, {}
templist = []
comtrack = 0
for key, val in pcount.iteritems():
templist = realc[comtrack:comtrack+val]
comtrack += val
for i in set(templist):
tempd[i] = templist.count(i)
algcount[key] = tempd
tempd = {}
#--- finding the node with the maximum label in that community
# key = int, value = dictionary
label = {}
for key, value in algcount.iteritems():
label[key] = max(value, key = value.get)
for i in dic:
for j in label:
if (int)(i[2]) == j:
i.append(label[j])
#--- METRICS
#--- Calculate the NMI
nt, np, ntp = {}, {}, {}
NMI = 0
nodes = len(dic)
for i in range(1, max(comcount)+1):
for j in range(1, max(pcount)+1):
nt[i] = 0
np[j] = 0
ntp[(str(i)+' '+str(j))] = 0
nt = comcount
np = pcount
for i in dic:
ntp[(str(i[1]).strip()+' '+str(i[2]).strip())] += 1
num, den = 0, 0
for i, t in nt.iteritems():
for j, p in np.iteritems():
temp, temp2 = 0, 0
if (ntp[(str(i).strip()+' '+str(j).strip())] > 0) and (t > 0) and (p > 0):
temp = ((ntp[(str(i).strip()+' '+str(j).strip())]*nodes)/(t*p))
temp2 = ntp[(str(i).strip()+' '+str(j).strip())]* math.log(temp)
num += temp2
d1, d2 = 0, 0
for t in nt.values():
d1 += t * math.log(t/nodes)
for p in np.values():
d2 += p * math.log(p/nodes)
den = d1 + d2
NMI = -(2*num)/den
#--- print results to screen
print 'NMI = %s' %(NMI)
f = open ( 'OUT/community-metric_nmi.txt', 'w' )
f.writelines("node\treal\talg\tnewlabel\n")
f.writelines("%s\t%s\t%s\t%s\n" % (i[0],i[1],i[2],i[3]) for i in dic)
f.writelines('NMI = %s' %(NMI))
f.close()
So the result is almost exactly the same, but the above code appears to take into account the problem with arbitrary community labels. Good to see both methods getting similar results, in any case.
In [24]:
dew = pd.read_csv("OUT/permute_p_values_disease.csv", delimiter="\t")
deuw = pd.read_csv("unweightedOUT/permute_p_values_disease.csv", delimiter="\t")
Check papers to read about the hypergeometric disease enrichment test:
In [25]:
dew.head()
Out[25]:
In [26]:
deuw.head()
Out[26]:
How to interpret these results? I suppose we would like to split it by the different diseases and then detect those with low p-values, probably by sorting them. All of the disease columns are regularly spaced, so should be easy to split into different dataframes:
In [27]:
def splitbydisease(diseasedataframe):
diseasedict = {}
clabels = diseasedataframe.iloc[:,[0,1,3,4,5,6,7]].columns
for offset in range(2,diseasedataframe.columns.shape[0]-1,6):
diseaseindexes = [0,1] + range(offset+1,offset+6)
diseasedict[diseasedataframe.columns[offset]]=diseasedataframe.iloc[:,diseaseindexes]
#fix the column names
diseasedict[diseasedataframe.columns[offset]].columns = clabels
return diseasedict
In [28]:
deuwdict = splitbydisease(deuw)
In [29]:
dewdict = splitbydisease(dew)
Now we can search for diseases in each of these communities with p-values less than 5%.
In [30]:
def significantdiseases(diseasedict):
sigdiseasedict = {}
for k in diseasedict:
pvalues = diseasedict[k].iloc[:,2]
pvalues = pvalues[pvalues<0.05]
sigdiseasedict[k] = diseasedict[k].iloc[pvalues.index]
return sigdiseasedict
In [31]:
sigdeuwdict = significantdiseases(deuwdict)
In [32]:
sigdewdict = significantdiseases(dewdict)
Now we want to iterate through the diseases and sort the lowest p-values, while annotating which disease and community is involved. Should probably build a new dataframe and put this information in it.
To keep the analysis simple, we only want to consider schizophrenia and alzheimers.
In [33]:
interestdiseases = ["Alzheimer's_disease",'schizophrenia']
In [34]:
for k in sigdeuwdict:
sigdeuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdeuwdict[k],interestdiseases)
In [35]:
sigunweighted = pd.concat(pieces)
In [36]:
for k in sigdewdict:
sigdewdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdewdict[k],interestdiseases)
In [37]:
sigweighted = pd.concat(pieces)
In [38]:
sigunweighted.sort(columns='p-value')
Out[38]:
In [39]:
sigweighted.sort(columns='p-value')
Out[39]:
Now all that's required is to retreive the IDs of the proteins in each of these communities to compare the different disease associated communities. Wait, Cytoscape already does that for me. I can just use that.
Writing these results to a file so I can discuss them later:
In [40]:
!git annex unlock unweighted_significant_disease_communities.tsv
In [41]:
sigunweighted.sort(columns='p-value').to_csv("unweighted_significant_disease_communities.tsv",
sep="\t",index=False)
In [42]:
!git annex unlock weighted_significant_disease_communities.tsv
In [43]:
sigweighted.sort(columns='p-value').to_csv("weighted_significant_disease_communities.tsv",
sep="\t",index=False)
In [44]:
!head unweighted_significant_disease_communities.tsv
The obvious question to ask is what is the crossover between the high confidence disease communities in each of these groupings. We can find this by comparing the members of each of these communities. While doing the NMI test we loaded in these communities, so we can access them and compare them easily.
In [45]:
weightedcommunities = {}
for l in zip(*parsecommunityout("OUT/communityout.txt")):
try:
weightedcommunities[l[1]] += [l[0]]
except KeyError:
weightedcommunities[l[1]] = [l[0]]
In [46]:
unweightedcommunities = {}
for l in zip(*parsecommunityout("unweightedOUT/communityout.txt")):
try:
unweightedcommunities[l[1]] += [l[0]]
except KeyError:
unweightedcommunities[l[1]] = [l[0]]
In [47]:
def findsimilarcom(community,communitydict):
overlapdict = {}
for k in communitydict:
#make two lists of inclusion/exclusion for all shared ids
uwids = community
wids = communitydict[k]
tids = set(uwids+wids)
winc,uwinc = [],[]
for i in tids:
winc.append(int(i in wids))
uwinc.append(int(i in uwids))
overlap = sum(map(lambda a,b: int(a==b), uwinc,winc))/len(tids)
if overlap > 0.0:
overlapdict[k] = overlap
return overlapdict
In [48]:
findsimilarcom(unweightedcommunities[29],weightedcommunities)
Out[48]:
So many of these proteins are spread around smaller other communities in the weighted case, but most are found in another community which overlaps 31%.
In [49]:
len(weightedcommunities[33]),len(unweightedcommunities[29])
Out[49]:
Investigating the effect of the weights. Would like to know which weighting might have caused the community to be split like this. We can inspect the graphs more easily by plotting them:
In [50]:
#load the edges of the graph
interactions = loadtxt("testdata/edgelist_update_weighted.txt",dtype=str)
interactions = interactions[1:]
In [51]:
import networkx as nx
In [52]:
def plotcommunities(com1,com2):
G = nx.Graph()
for l in interactions:
if l[0] in set(com1+com2) and l[1] in set(com1+com2):
G.add_edge(l[0],l[1],weight=float(l[2]))
edict = {}
lim = min([d['weight'] for (u,v,d) in G.edges(data=True)])
diff = linspace(lim,1.0,10)[1]- linspace(lim,1.0,10)[0]
for x in linspace(lim,1.0,10):
edict[x] = [(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > x and d['weight'] < x+diff]
pos = nx.circular_layout(com1)
pos2 = nx.circular_layout(set(com2)-set(com1))
for k in pos2:
pos2[k] = array([pos2[k][0]+1.5,pos2[k][1]])
pos = dict(pos.items()+pos2.items())
nx.draw_networkx_nodes(G,pos,node_size=20,alpha=0.5)
for k in edict:
nx.draw_networkx_edges(G,pos,edgelist=edict[k],alpha=(k-lim)*(1/(1-lim)),edge_color='r')
#nx.draw_networkx_edges(G,pos,edgelist=edict[k],edge_color='r')
l=nx.draw_networkx_labels(G,pos,font_size=10,font_family='sans-serif')
return None
In [53]:
plotcommunities(weightedcommunities[33],unweightedcommunities[29])
In [80]:
import os
In [81]:
plotpath = os.path.abspath("../plots/bayes/")
In [82]:
savez(os.path.join(plotpath,"nx2933.npz"),unweightedcommunities[29],weightedcommunities[33])
In [54]:
plotcommunities(unweightedcommunities[29],weightedcommunities[33])
Average weighting of the edges within each community:
In [56]:
def averageweight(community):
weights = []
for l in interactions:
if l[0] in community and l[1] in community:
weights.append(float(l[2]))
return average(weights)
In [57]:
averageweight(weightedcommunities[33])
Out[57]:
In [58]:
averageweight(unweightedcommunities[29])
Out[58]:
In [59]:
s=findsimilarcom(weightedcommunities[44],unweightedcommunities)
s
Out[59]:
So the proteins involved are split amount some larger communities in the unweighted case, mostly in community 6. We can plot these:
In [60]:
plotcommunities(weightedcommunities[44],unweightedcommunities[64])
In [83]:
savez(os.path.join(plotpath,"nx6444.npz"),unweightedcommunities[64],weightedcommunities[44])
In [61]:
plotcommunities(unweightedcommunities[64],weightedcommunities[44])
In [62]:
schw = pd.read_csv("OUT/permute_p_values_SCH_enriched.csv", delimiter="\t")
schuw = pd.read_csv("unweightedOUT/permute_p_values_SCH_enriched.csv", delimiter="\t")
In [63]:
sigschwdict = significantdiseases(splitbydisease(schw))
sigschuwdict = significantdiseases(splitbydisease(schuw))
In [64]:
for k in sigschwdict:
sigschwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschwdict[k],sigschwdict)
In [65]:
sigschweighted = pd.concat(pieces)
In [66]:
for k in sigschuwdict:
sigschuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschuwdict[k],sigschuwdict)
In [67]:
sigschunweighted = pd.concat(pieces)
In [68]:
sigschweighted.sort(columns='p-value')
Out[68]:
In [69]:
sigschunweighted.sort(columns='p-value')
Out[69]:
In [72]:
!git annex unlock unweighted_significant_SCH_communities.tsv
In [73]:
!git annex unlock weighted_significant_SCH_communities.tsv
In [74]:
#save these tables
sigschunweighted.sort(columns='p-value').to_csv("unweighted_significant_SCH_communities.tsv",
sep="\t",index=False)
sigschweighted.sort(columns='p-value').to_csv("weighted_significant_SCH_communities.tsv",
sep="\t",index=False)
In [75]:
overlap = pd.read_csv("unweightedOUT/overlap_dis_SCHenriched.csv", sep="\t",header=None,names=range(139))
In [76]:
overlap = overlap.fillna('')
We are only interested in the overlap with schizophrenia, so we want to cut out just that column.
In [77]:
overlap = overlap.iloc[:,[0,1,108]]
In [78]:
with open("unweightedOUT/overlap_dis_SCHenriched.csv") as f:
c = csv.reader(f,delimiter="\t")
overlap = {}
for l in c:
try:
if l[0] == 'community':
currentcommunity = l[1]
if len(l) > 108:
if l[108] == '1':
try:
overlap[currentcommunity] += [l[0]]
except KeyError:
overlap[currentcommunity] = [l[0]]
except:
pass
Ok, so we now have access to this data.